Add support for writing timestamps without timezone. by sshkvar · Pull Request #1 · sshkvar/iceberg

sshkvar · 2021-06-11T13:41:56Z

Add spark.sql.iceberg.store-timestamp-without-zone spark config to indicate which iceberg type (Types.TimestampType.withZone() or Types.TimestampType.withoutZone()) will be used for spark TimestampType type

This also adds AssertJ to testCompile in all modules so assertions can be used elsewhere.

…ache#2699)

* Spec: Add identifier-field-ids to schema. * Spec: Add section for partition evolution. * Spec: Add schemas list and current-schema-id to table metadata. * Spec: Add key_metadata to manifest list. * Spec: Add schema-id to Snapshot metadata.

Parquet changelog: https://github.com/apache/parquet-mr/blob/master/CHANGES.md#version-1120

…pache#2698)

`.withFailMessage(..)` was mistakenly used and was therefore overriding the actual error reporting, making debugging difficult.

…he#2705)

…#2689) * support custom target name in partition spec builder * address the comments.

…org (apache#2709) Co-authored-by: tgooch <tgooch@netflix.com>

See also https://errorprone.info/bugpattern/ReferenceEquality

…rnalRowWrapper (apache#2683)

Param scanAllFiles Used to check whether all the data files should be processed, or only added files.Here we should replace scanAllFiles to !scanAllFiles.

Co-authored-by: Ryan Blue <blue@apache.org>

Co-authored-by: tgooch <tgooch@netflix.com>

…e v1 metadata (apache#2691)

RussellSpitzer · 2021-06-21T14:37:58Z

spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java

+  public static final String HANDLE_TIMESTAMP_WITHOUT_TIMEZONE_SESSION_PROPERTY =
+          "spark.sql.iceberg.handle-timestamp-without-timezone";
+
+  public static final String STORE_TIMESTAMP_WITHOUT_TIMEZONE_SESSION_PROPERTY =


I think it's fine to just use one property for both reading and writing

Actually there was incorrect name of this property, changed to READ_TIMESTAMP_AS_TIMESTAMP_WITHOUT_TIMEZONE.

This property is responsible for handling how we will represent spark TimestampType type in iceberg. By default spark TimestampType type will be converted to Types.TimestampType.withZone() iceberg type, but if we set READ_TIMESTAMP_AS_TIMESTAMP_WITHOUT_TIMEZONE to true, spark timestamp type will be converted to Types.TimestampType.withoutZone()

RussellSpitzer · 2021-06-21T14:40:32Z

spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java

+   */
+  public static boolean hasTimestampWithoutZone(Schema schema) {
+    return TypeUtil.find(schema, t ->
+            t.typeId().equals(Type.TypeID.TIMESTAMP) && !((Types.TimestampType) t).shouldAdjustToUTC()


could just check against TimestampType.withoutZone() which returns the singleton instance

good point, changed the code

RussellSpitzer · 2021-06-21T14:43:29Z

spark/src/main/java/org/apache/iceberg/spark/SparkTypeToType.java

 class SparkTypeToType extends SparkTypeVisitor<Type> {
  private final StructType root;
  private int nextId = 0;
+  private final boolean useTimestampWithoutZone;


I'm not sure what this flag is for, I think it's probably safer to always return timestamptype.withZone from here, and handle the mismatch outside of this code

Moved this logic to SparkFixupTimestampType.java

RussellSpitzer · 2021-06-21T14:44:30Z

spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java


 public class SparkUtil {
+
+  public static final String HANDLE_TIMESTAMP_WITHOUT_TIMEZONE_FLAG = "spark-handle-timestamp-without-timezone";


Acutally we could probably just use the same session property here, so we just have one property

spark.sql.iceberg.convert-timestamp-without-timezone

Or something like that

RussellSpitzer · 2021-06-21T14:48:27Z

spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcWriter.java

          return SparkOrcValueWriters.decimal(primitive.getPrecision(), primitive.getScale());
        case TIMESTAMP_INSTANT:
+        case TIMESTAMP:
          return SparkOrcValueWriters.timestampTz();


Do we need corresponding code in the ParquetWriter as well?

RussellSpitzer · 2021-06-21T14:49:50Z

spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java


  private StructType lazyType() {
    if (type == null) {
+      Preconditions.checkArgument(readTimestampWithoutZone || !SparkUtil.hasTimestampWithoutZone(lazySchema()),


Actually cancel what I said about those other spots, this seems like a great place to check for whether we are allowed to do the TZ conversion

ok, so we don't need any changes here?

…spark 2.4

…ted level should be one of the following: 6, 8. `

…comments. Fixed code formatting

sshkvar · 2021-07-14T16:02:13Z

@sshkvar It seems like this PR is out of sync with master so I can't merge it, can you rebase it?

I thinks I need to close this PR, main PR is apache#2757

github-actions bot added the SPARK label Jun 11, 2021

wangjunyou and others added 23 commits June 12, 2021 16:21

Docs: Fix variable name in Flink doc (apache#2668)

3569172

Nessie: Use AssertJ assertions (apache#2684)

2c29032

This also adds AssertJ to testCompile in all modules so assertions can be used elsewhere.

Core: add key_metadata to ManifestFile schema and classes (apache#2675)

8104769

Docs: Add cache-enabled to catalog property list (apache#2648)

a6b0f42

ORC: Remove unused constants in generic readers (apache#2695)

5a61211

AWS: Fix typo in S3OutputFile.createOrOverwrite exception message (ap…

7be269d

…ache#2699)

Update spec for v2 changes (apache#2654)

e79580b

* Spec: Add identifier-field-ids to schema. * Spec: Add section for partition evolution. * Spec: Add schemas list and current-schema-id to table metadata. * Spec: Add key_metadata to manifest list. * Spec: Add schema-id to Snapshot metadata.

Parquet: Update to 1.12.0 (apache#2441)

2366154

Parquet changelog: https://github.com/apache/parquet-mr/blob/master/CHANGES.md#version-1120

[python] Adding type ignores for dependencies without type support (a…

3c41289

…pache#2698)

Core: Use .as() with AssertJ (apache#2706)

b2ebf22

`.withFailMessage(..)` was mistakenly used and was therefore overriding the actual error reporting, making debugging difficult.

API: Add more null checks to TableIdentifier (apache#2703)

629da77

API: Fix Namespace null handling (apache#2704)

1f64154

Docs: Add Adobe Migration article (apache#2707)

9c859df

Core: Do not allow optional, double, or float identifier fields (apac…

9ebebdd

…he#2705)

[Python] support custom target name in partition spec builder (apache…

4c013a8

…#2689) * support custom target name in partition spec builder * address the comments.

Docs: updating README to link to github actions instead of travis-ci.…

d9bf148

…org (apache#2709) Co-authored-by: tgooch <tgooch@netflix.com>

Core: Use equals instead of reference equality (apache#2714)

965775a

See also https://errorprone.info/bugpattern/ReferenceEquality

Tests: Add unit tests for InternalRecordWrapper, RowDataWrapper, Inte…

ec69a25

…rnalRowWrapper (apache#2683)

Spark: Fix scanAllFiles in MicroBatch.open (apache#2667)

7798094

Param scanAllFiles Used to check whether all the data files should be processed, or only added files.Here we should replace scanAllFiles to !scanAllFiles.

Docs: Add commit.status-check.* properties (apache#2661)

e750316

Co-authored-by: Ryan Blue <blue@apache.org>

Core: Add delete marker metadata column (apache#2538)

a9f4363

[python] Adding Unknown and Void transforms (apache#2697)

765ec12

Co-authored-by: tgooch <tgooch@netflix.com>

fix: add and remove partition transform on same column failed when us…

619603c

…e v1 metadata (apache#2691)

RussellSpitzer reviewed Jun 21, 2021

View reviewed changes

sshkvar added 12 commits July 14, 2021 16:17

Added missed check for handling tomestamp without zone for Writer in …

4bee290

…spark 2.4

Added missed check for handling tomestamp without zone for Writer in …

0259262

…spark 2.4

Address PR comments.

1579abc

Address PR comments.

4f37486

Address PR comments.

bafaffb

Address PR comments.

7acaec0

Address few little clean up

cee72a0

Address few little clean up

459ce89

Address few little clean up

d14b2b2

fix for `'lambda arguments' has incorrect indentation level 12, expec…

398a2a0

…ted level should be one of the following: 6, 8. `

fix for incorrect indentation level 6, expected level should be 8.

3f6b0f2

Added withSQLConf method to AvroDataTest.java as suggested in the PR …

bc316c4

…comments. Fixed code formatting

sshkvar force-pushed the spark-parquet-read-ts-as-tstz branch from e869a15 to bc316c4 Compare July 14, 2021 14:42

github-actions bot added INFRA BUILD DOCS API CORE PYTHON PARQUET ARROW ORC HIVE DATA FLINK MR AWS NESSIE labels Jul 14, 2021

sshkvar closed this Jul 14, 2021


		public class SparkUtil {

		public static final String HANDLE_TIMESTAMP_WITHOUT_TIMEZONE_FLAG = "spark-handle-timestamp-without-timezone";

Conversation

sshkvar commented Jun 11, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Jun 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sshkvar Jun 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sshkvar commented Jul 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

RussellSpitzer Jun 21, 2021 •

edited

Loading

sshkvar Jun 29, 2021 •

edited

Loading